230 PART 5 Looking for Relationships with Correlation and Regression
Calculating the Sample Size You Need
To estimate how many data points you need for a regression analysis, you need to
first ask yourself why you’re doing the regression in the first place.»
» Do you want to show that the two variables are statistically significantly
associated? If so, you want to calculate the sample size required to achieve a
certain statistical power for the significance test (see Chapter 3 for an introduc-
tion to statistical power).»
» Do you want to estimate the value of the slope (or intercept) to within a
certain margin of error? If so, you want to calculate the sample size required
to achieve a certain precision in your estimate.
Testing the statistical significance of a slope is exactly equivalent to testing the
statistical significance of a correlation coefficient, so the sample-size calculations
are also the same for the two types of tests. If you haven’t already, check out
Chapter 15, which contains guidance and formulas to estimate how many partici-
pants you need to test for any specified degree of correlation.
If you’re using regression to estimate the value of a regression coefficient — for
example, the slope of the straight line — then the sample-size calculations
become more complicated. The precision of the slope depends on several factors:»
» The number of data points: More data points give you greater precision. SEs
vary inversely with the square root of the sample size. Alternatively, the
required sample size varies inversely with the square of the desired SE. So, if
you quadruple the sample size, you cut the SE in half. This is a very important
and generally applicable principle.»
» Tightness of the fit of the observed points to the line: The closer the data
points hug the line, the more precisely you can estimate the regression
coefficients. The effect is directly proportional, in that twice as much Y-scatter
of the points produces twice as large a SE in the coefficients.»
» How the data points are distributed across the range of the X variable:
This effect is hard to quantify, but in general, having the data points spread
out evenly over the entire range of X produces more precision than having
most of them clustered near the middle of the range.
Given these factors, how do you strategically design a study and gather data for a
linear regression where you’re mainly interested in estimating a regression coef-
ficient to within a certain precision? One practical approach is to first conduct a
study that is small and underpowered, called a pilot study, to estimate the SE of the